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Abstract. Often one has a preference order among the different systems 
that satisfy a given specification. Under a probabilistic assumption about 
the possible inputs, such a preference order is naturally expressed by a 
weighted automaton, which assigns to each word a value, such that a system 
is preferred if it generates a higher expected value. We solve the following 
optimal-synthesis problem: given an omega-regular specification, a Markov 
chain that describes the distribution of inputs, and a weighted automaton 
that measures how well a system satisfies the given specification under the 
given input assumption, synthesize a system that optimizes the measured 
value. 

For safety specifications and measures that are defined by mean-payoff 
automata, the optimal-synthesis problem amounts to finding a strategy 
in a Markov decision process (MDP) that is optimal for a long-run aver- 
age reward objective, which can be done in polynomial time. For general 
omega-regular specifications, the solution rests on a new, polynomial-time 
algorithm for computing optimal strategies in MDPs with mean-payoff 
parity objectives. Our algorithm generates optimal strategies consisting of 
two memoryless strategies and a counter. This counter is in general not 
bounded. To obtain a finite-state system, we show how to construct an 
e-optimal strategy with a bounded counter for any e > 0. Furthermore, 
we show how to decide in polynomial time if we can construct an optimal 
finite-state system (i.e., a system without a counter) for a given specifica- 
tion. 

We have implemented our approach and the underlying algorithms in a tool 
that takes qualitative and quantitative specifications and automatically 
constructs a system that satisfies the qualitative specification and opti- 
mizes the quantitative specification, if such a system exists. We present 
some experimental results showing optimal systems that were automati- 
cally generated in this way. 



1 Introduction 



Building correct and reliable programs is one of the key challenges in computer 
science. Automatic verification and synthesis aims to address this problem by 



defining correctness with respect to a formal specification, a mathematical de- 
scription of the desired behavior of the system. In automatic verification, we ask 
if a given system satisfies a given specification [18,43,20]. The synthesis problem 
asks to automatically derived a system from a specification [17,44,41]. Tradition- 
ally, the verification and synthesis problem are studied with respect to Boolean 
specifications in an adversarial environment: the Boolean (or qualitative) specifi- 
cation maps each possible behavior of a system to true or false indicating if this 
behavior is a desired behavior or not. Analyzing a system in an adversarial en- 
vironment corresponds to considering the system under the worst-case behavior 
of the environment. In this work we study the verification and synthesis prob- 
lem for quantitative objectives in probabilistic environments, which corresponds 
to analyzing the system under the average-case behavior of its environment. 

Quantitative reasoning is traditionally used to measure quantitative properties 
of systems, such as performance or reliability (cf. [1,34,4,38]). Quantitative rea- 
soning has also been shown useful in the classically Boolean contexts of verification 
and synthesis [6,35]. In particular, by augmenting a Boolean specifications with a 
quantitative specifications, we can measure how "well" a system satisfies the spec- 
ification. For example, among systems that respond to requests, we may prefer one 
system over another if it responds quicker, or it responds to more requests, or it 
issues fewer unrequested responses, etc. In synthesis, we can use such measures to 
guide the synthesis process towards deriving a system that is, in the desired sense, 
"optimal" among all systems that satisfy the specification [6] . 

There are many ways to define a quantitative measure that captures the "good- 
ness" of a system with respect to the Boolean specification, and particular mea- 
sures can be quite different, but there are two questions every such measure has 
to answer: (1) how to assign a quantitative value to one particular behavior of 
a system (measure along a behavior) and (2) how to aggregate the quantitative 
values that are assigned to the possible behaviors of the system (measure across 
behaviors). Recall the response property. Suppose there is a sequence of requests 
along a behavior and we are interested primarily in response time, i.e., the quicker 
the system responds, the better. As measure (1) along a particular behavior, we 
may be interested in an average or the supremum (i.e., worst case) of all response 
times, or in any other function that aggregates all response times along a behavior 
into a single real value. The choice of measure (2) across behaviors is independent: 
we may be interested in an average of all values assigned to individual behaviors, 
or in the supremum, or again, in some other function. In this way, we can choose 
to measure the average (across behaviors) of average (along a behavior) response 
times, or the average of worst-case response times, or the worst case of average 
response times, or the worst case of worst-case response times, etc. Note that these 
are the same two choices that appear in weighted automata and max-plus algebras 
(cf. [29,32,21]). 

In previous work, we studied various measures (1) along a behavior. In partic- 
ular, lexicographically ordered tuples of averages [6] and ratios [7] are of natural 
interest in certain contexts. Alur et al. [2] consider an automaton model with 
a quantitative measure (1) that is defined with respect to certain accumulation 
points along a behavior. However, in all of these cases, for measure (2) only the 
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worst case (i.e., supremum) is considered. This comes natural as an extension of 
Boolean thinking, where a system fails to satisfy a property if even a single be- 
havior violates the property. But in this way, we cannot distinguish between two 
systems that have the same worst cases across behaviors, but in one system al- 
most all possible behaviors exhibit the worst case, while in the other only very few 
behaviors do so. In contrast, in performance evaluation one usually considers the 
average case across different behaviors. 

For instance, consider a resource controller for two clients. Clients send re- 
quests, and the controller grants the resource to one of them at a time. Suppose 
we prefer, again, systems where requests are granted "as quickly as possible." Ev- 
ery controller that avoids simultaneous grants will have a behavior along which 
at least one grant is delayed by one step, namely, the behavior along which both 
clients continuously send requests. The best the controller can do is to alternate 
between the clients. However, if systems are measured with respect to the worst 
case across different behaviors, then a controller that always alternates between 
both clients, independent of the actual requests, is as good as a controller that 
tries to grant all requests immediately and only alternates when both clients re- 
quest the resource at the same time. Clearly, if we wish to synthesize the preferred 
controller, we need to apply an average-case measure across behaviors. 

In this paper, we present a measure (2) that averages across all possible behav- 
iors of a system and solve the corresponding synthesis problem to derive an opti- 
mal system. In synthesis, the different possible behaviors of a system are caused by 
different input sequences. Therefore, in order to take a meaningful average across 
different behaviors, we need to assume a probability distribution over the possible 
input sequences. For example, if on input a system has response time ro, and 
on input 1 response time n, and input is twice as likely as input 1, then the 
average response time is (2ro + ?"i)/3. 

The resulting synthesis problem is as follows: given a Boolean specification ip, 
a probabilistic input assumption /z, and a measure that assigns to each system M 
a value Vfi(M) of how "well" M satisfies ip under \i, construct a system M such 
that V£(M) > V-f(M') for all M'. We solve this problem for qualitative specifi- 
cations that are given as w-automata, input assumptions that are given as finite 
Markov chains, and a quantitative specification given as mean-payoff automata 
which defines a quantitative language by assigning values to behaviors. From the 
above three inputs we derive a measure that captures (1) an average along system 
behaviors as well as (2) an average across system behaviors; and thus we obtain a 
measure that induces a value for each system. 

To our knowledge this is the first solution of a synthesis problem for an average- 
case measure across system behaviors. Technically the solution rests on a new, 
polynomial-time algorithm for computing optimal strategics in MDPs with mean- 
payoff parity objectives. In contrast to MDPs with mean-payoff objectives, where 
pure memoryless optimal strategies exist, optimal strategies for mean-payoff par- 
ity objectives in MDPs require infinite memory. It follows from our result that the 
infinite memory can be captured with a counter, and with this insight we develop 
the polynomial time algorithm for solving MDPs with mean-payoff parity objec- 
tives. A careful analysis of the constructed strategies allows us to construct, for any 
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e > 0, a finite-state system that is within e of the optimal value. Furthermore, we 
present a polynomial-time procedure to decide if there exists a finite-state system 
(system without a counter) that achieves the optimal value for a mean-payoff par- 
ity specification. We show that for MDPs with mean-payoff parity objectives finite 
memory does not help, i.e., cither the optimal strategy requires infinite memory or 
there exists a memoryless strategy that also achieves the optimal value. We give 
a linear program to check if there exists a memoryless strategy that is optimal. 

Related works Many formalisms for quantitative specifications have been consid- 
ered in the literature [2, 8-11, 23, 24, 27, 28, 37] ; most of these works (other than [2, 
11,23]) do not consider mean-payoff specifications and none of these works focus 
on how quantitative specifications can be used to obtain better implementations 
for the synthesis problem. Furthermore, several notions of metrics for probabilis- 
tic systems and games have been proposed in the literature [25, 26]; these metrics 
provide a measure that indicates how close are two systems with respect to all 
temporal properties expressible in a logic; whereas our work uses quantitative 
specification to compare systems with respect to the property of interest. Similar 
in spirit but based on a completely different technique, is the work by Niebert et 
al. [39] , who group behaviors into good and bad with respect to satisfying a given 
LTL specification and use a CTL*-like analysis specification to quantify over the 
good and bad behaviors. This measure of logical properties was used by Katz and 
Peled [35] to guide genetic algorithms to discover counterexamples and corrections 
for distributed protocols. Control and synthesis in the presence of uncertainty has 
been considered in several works such as [3, 19, 5]: in all these works, the framework 
consists of MDPs to model nondeterministic and probabilistic behavior, and the 
specification is a Boolean specification. In contrast to these works where the proba- 
bilistic choice represent uncertainty, in our work the probabilistic choice represent 
a model for the environment assumption on the input sequences that allows us 
to consider the system as a whole. Moreover, we consider quantitative objectives. 
Parr and Russel [40] also synthesize strategies for MDPs that optimize a quan- 
titative objectives. They optimize with respect to the expected discounted total 
reward, while we consider mean-payoff objectives. Furthermore, we allow the user 
(i) to provide additionally qualitative (in particular liveness) constraints and (ii) to 
specify the qualitative and the quantitative constraints independent of the MDP. 
MDPs with mean-payoff objectives are well studied. The books [30, 42] present a 
detailed analysis of this topic. We present a solution to a more general condition: 
the Boolean combination of mean- payoff and parity condition on MDPs. We show 
that MDPs with mean-payoff parity objectives can be solved in polynomial time. 

Structure of the paper Section 2 gives the necessary theoretical background 
and fixes the notation. In Section 3 we introduce the problem of measuring systems 
with respect to quantitative specifications using several examples, define our new 
measure, and show how to compute the value of a system with respect to this 
measure. In Section 4 we show how to construct a system that satisfy a qualitative 
specification and optimize a quantitative specification with respect to our new 
measure. In Section 5 we present experimental results and we conclude in Section 6. 
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This paper is an extended and improved version of [16] that includes new 
theoretical results, more examples, detailed proofs, and reports on an improved 
implementation. We present new theoretical results related to finite-state strategics 
for approximating the values in mean-payoff parity MDPs and a polynomial-time 
procedure to decide the existence of memoryless strategy that achieves the optimal 
value. 

2 Preliminaries 

2.1 Alphabet, Words, and Languages 

An alphabet S consists of a finite set of letters a 6 S. We often use letters 
representing assignments to a set of Boolean variables V. In this case we write 
S = 2 V , i.e., S is the set of all subsets of V, and a letter a — {vi, ...,«„} € 2 V 
encodes the unique assignment, in which all variables in a are set to true and 
all other variables are set to false. A word w over S is either a finite or infinite 
sequence of letters, i.e., w G S* U . Given a word w <G E", we denote by Wi the 
letter at position iofw and by w l the prefix of w of length i, i.e., w l = w x w 2 ■ ■ . to*. 
We denote by |tu| the length of the word w, i.e., \w l \ — i and |tu| = oo, if w is 
infinite. A qualitative language L is a subset of S u . A quantitative language L [11] 
is a mapping from the set of words to the set of reals, i.e., L : U u — > M. Note that 
the characteristic function of a qualitative language L is a quantitative language 
mapping words to and 1. Given a qualitative language L, we use L also to denote 
its characteristic function. 

2.2 Automata with Parity, Safety, and Mean-Payoff Objective 

An (finite-state) automaton is a tuple A = (E,Q,qo, A), where S is a alphabet, 
Q is a (finite) set of states, qo E Q is an initial state, and A : Q x S — > Q 5 is a 
transition function that maps a state and a letter to a successor state. The run 
of A on a word w = wqW\ ... is a sequence of states p = p pi . . . such that (i) 
po = q and (ii) for all < i < \w\, A{p u Wi) = Pi+i). 

A parity automaton is a tuple A = ((£, Q, qo, A),p), where (S, Q, qa, A) is a 
finite-state automaton and p : Q — > {0, 1, . . . , d} is a priority function that maps 
every state to a natural number in [0, d] called priority. A parity automaton A 
accepts a word w if the least priority of all states occurring infinitely often in the 
run p of A on w is even, i.e., min 9£ i n f r p \ p(q) is even, where Inf(p) = {q \ Vi3j > 
i Pj = q}. The language of A denoted by La is the set of all words accepted 
by A. A safety automaton is a parity automaton with only priorities and 1, 
and no transitions from priority-1 to priority-0 states. A mean-payoff automaton 
is a tuple A = ((S,Q,qa, A),r), where (£, Q, qo, A) is a finite-state automaton 
and r:Qx£^Nisa reward function that associates to each transition of the 
automaton a reward v 6 N. A mean-payoff automaton assigns to each word w the 

5 Note that our automata are deterministic and complete to simplify the presentation. 
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Fig. 1. Mean-payoff automaton A Fig. 2. Finite-state system M 

long-run average of the rewards, i.e., for a word w let p be the run of A on w, then 
we have 

L A (w) = [ " ' r<A ' ^ if W iS finit °' 

1 liminf„_ ) . 00 L^(w n ) otherwise. 

Note that La is a function assigning values to words. 

Example 1. Figure 1 shows a mean-payoff automaton A — ((S,Q,qo,A),r) for 
words over the alphabet S = 2^ r ' 3 ^ = {{}, {r}, {g}, {r, g}}, which are all possible 
assignments to the two Boolean variables r and g. E.g., the letter {r} means 
that variable r is true and all the other variables (in this case only g) are false. 
The automaton has two states go and gi represented by circles. State qo is the 
initial state, which is indicated by the straight arrow from the left. Transitions are 
represented by directed arrows. They are labeled with (i) a conjunction of literals 
representing a set of letters and (ii) in parentheses, the reward obtained when 
following this transition. If a variable v appears in positive form in a label, then 
we can take this transition only with a letter that includes v. If the variable v 
appear in negated form (i.e., v), then this transition can only be taken with letter 
that do not include v. Note that transitions depend only on the signals that appear 
in their labels. E.g., the self-loop on state qo labeled with g(l) means that we can 
move from qo to go with any letter that includes g, i.e., either with letter {<?} or 
with letter {r, g}. The automaton assigns to each word in S^ the average reward. 
E.g., the run of A on the word ({r} {r} {rg})" is (go go gi)" and the corresponding 
sequence of rewards is (0 1)" with an average reward of (0 + + l)/3 = 1/3. 



2.3 State machines and Specifications 

A (finite-) state machine (or system) with input signals I and output signals O 
is a tuple M = (Q,q , A, \), where (Si, Q, go, A) with Sj = 2 1 is a (finite-state) 
automaton and A : Q x — >• S with Sj = 2 1 and So = 2° is a labeling function 
that maps every transition in A to an element in So- The sets Sj and So are 
called the input and the output alphabet of M, respectively. We denote the joint 
alphabet 2 /u ° by S. 

Given an input word w £ Sj U Sf , let p by the run of M on w, the outcome 
of M on w, denoted by O m (w), is the word v G S* U S u s.t. for all < i < \w\, 
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Vi = WiU\(pi,Wi). Note that Om is the function mapping input words to outcomes. 
The language of M denoted by Lm is the set of outcomes of M on all infinite input 
word. 

Example 2. Consider the system M depicted in Figure 2. System M has one 
Boolean input variable r and one Boolean output variables g. In every step, M 
reads the value of the variable r and sets the value of the variable g. More pre- 
cisely, M sets g to false, whenever either r is false in the current step or g have 
been true in the previous step. The input alphabet of M is 2^ = {{}, {r}} and 
its output alphabet is 2^ = {{}, {g}}. Recall that all variables that are absent in 
a letter are set to false, e.g., the input letter {} means that the value of r is false, 
while {r} refers to r being true. We again label edges with conjunctions of literals. 
The conjunction on the left of the slash describes a set of input letters, i.e., a set 
of assignments to the input variables. The conjunction on the right describes a 
single output letter, which corresponds to an assignment of the output varibles. 
E.g., the transition from state qi to state qo labeled /g means that if the system 
is in state qi , then it moves to the state qo and sets the variables g to false for any 
input letter because the conjunction for the input variables is empty. 

Consider the input word w = {r}{r}{}{r}. The outcome of M of w is the 
combined word {rg}{r}{}{rg}. The language of M are all the infinite words gen- 
erated by arbitrarily concatenating the following three words: (i) w\ — {}, (ii) 
w 2 = {r,g}{r}, and (iii) w 3 = {r,g}{}, i.e., L M = (wi|w 2 |w 3 ) w . 

We analyze state machines with respect to qualitative and quantitative spec- 
ifications. Qualitative specifications are qualitative languages, i.e., subsets of 
or equivalently functions mapping words to and 1. We consider w-regular spec- 
ifications given as safety or parity automata. Given a safety or parity automaton 
A and a state machine M, we say M satisfies La (written M \= La) if Lm C La 
or equivalently Vw £ Sf : La(Om(w)) = 1. A quantitative specification is given 
by a quantitative language L, i.e., a function that assigns values to words. Given 
a state machine M, we use function composition to relate L and M , i.e., L o Om 
is mapping every input word w of M to the value assigned by L to the outcome of 
M on w. We consider quantitative specifications given by Mean-payoff automata. 

2.4 Markov Chains and Markov Decision Processes (MDP) 

A probability distribution over a finite set S is a function d : S — > [0, 1] such that 
J2 q eQ dio) = 1- We denote the set of all probabilistic distributions over S by V(S). 
A Markov Decision Process (MDP) G = (S, so, E, Si, Sp, S) consists of a finite set 
of states S, an initial state so £ S, a set of edges E C S x S , a partition (Si, Sp) of 
the set S, and a probabilistic transition function 5: Sp — > V(S). The states in 5i 
are the player-1 states, where player 1 decides the successor state; and the states 
in Sp are the probabilistic states, where the successor state is chosen according 
to the probabilistic transition function S. So, we can view an MDP as a game 
between two players: player 1 and a random player that plays according to S. We 
assume that for s £ Sp and t £ S, we have (s, t) £ E iff 6(s)(t) > 0, and we often 
write S(s,t) for 8(s)(t). For technical convenience we assume that every state has 
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at least one outgoing edge. For a state s G S, we write E(s) to denote the set 
{teS (s,i) G -E} of possible successors. If the set Si = 0, then G is called a 
Markov Chain and we omit the partition (Si, Sp) from the definition. A S-labeled 
MDP (G, A) is an MDP G with a labeling function A : S — > S assigning to each 
state of G a letter from E. We assume that labeled MDPs are deterministic and 
complete, i.e., (i) V(s, s'), (s, s") G E 1 , A(s') = A(s") -)■ s' = s" holds, and (ii) 
Vs 6 S,(T 6 £, 3s' G S s.t. (s, s') G E and A(s') = a. 

2.5 Plays and strategies 

An infinite path, or a play, of the MDP G is an infinite sequence oj — S0S1S2 ... of 
states such that (sfc,Sfc+i) G _E for all k G N. Note that we use w only to denote 
plays, i.e., infinite sequences of states. We use v to refer to finite sequences of states. 
We write Q for the set of all plays, and for a state s G S, we write J? s C J? for the 
set of plays starting at s. A strategy for player 1 is a function w: S*Si — > I>(S) that 
assigns a probability distribution to all finite sequences v G 5* Si of states ending 
in a player-1 state. Player 1 follows n, if she make all her moves according to the 
distributions provided by ir. A strategy must prescribe only available moves, i.e., 
for all v G S*, s G Si, and t G S, if 7r(ws)(i) > 0, then (s,t) G £. We denote 
by I? the set of all strategies for player 1. Once a starting state s G S and a 
strategy n G II is fixed, the outcome of the game is a random walk cjJ for which 
the probabilities of every event A C i7, which is a measurable set of plays, are 
uniquely defined. 

For a state s G S and an event .4 C J?, we write ^ {A) for the probability that 
a play belongs to A if the game starts from the state s and player 1 follow the 
strategy n, respectively. For a measurable function /:]?—>• R we denote by EJ[/] 
the expectation of the function / under the probability measure /ij (•). 

Strategies that do not use randomization are called pure. A player-1 strategy ir 
is pure if for all v G S* and s G Si, there is a state t G S such that ir(vs)(t) = 1. 
A memoryless player-1 strategy depends only on the current state, i.e., for all 
v, v' G S* and for all s G Si we have 7r(ws) = 7r(w's). A memoryless strategy can be 
represented as a function ir: Si — > T>(S). A pure memoryless strategy is a strategy 
that is both pure and memoryless. A pure memoryless strategy can be represented 
as a function ir: Si — > S. A pure finite-state strategy is a strategy that can be 
represent by a finite-state machine M = (Q, qo, A, A) with input alphabet Sj = S 
and output alphabet So = S. The state Q represent a set of memory locations 
with qo as the initial memory content. The transition function A : Q x S — > Q 
describes how to update the memory while moving to the next state in the MDP. 
The labeling function A : Q x S — > S defines the moves of Player 1, i.e., for every 
memory location and state of the MDP, it provides a successor state in the MDP. 

2.6 Resulting Markov chains, recurrence classes, unichain, and 
multichain 

Given an MDP G and a pure memoryless or finite-state strategy ir, if we restrict G 
to follow the actions suggested in ir, we obtain a Markov chain. 
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Given a Markov chain G = (S, s 0} E,8) 7 a state s £ S is called recurrent 6 if the 
expected number of visits to s is infinite. Otherwise, the state s is called transient. 
A maximal set of recurrent states that is closed 7 under E is called recurrence class. 
A Markov chain G is unichain if it has a single recurrence class. Otherwise, G is 
called multichain. 



2.7 Quantitative Objectives 

A quantitative objective is given by a measurable function /:/?—> R. We consider 
several objectives based on priority and reward functions. Given a priority function 
p : S — > {0, 1, . . . , d}, we defined the set of plays satisfying the parity objective 
as f2 p = {u! £ Q min (p(Inf(u;))) is even}. A Parity objective parity^ is the 
characteristic function of fi p . Given a reward function r : S — > NU{_L}, the 
mean-payoff objective mean r for a play w = S1S2S3 ... is defined as mean r (cj) = 
liminf„_ i . 00 — • X)"=i r ( s i)i if f° r all i > : r(sj) ^ _L, otherwise mean r (o;) = _L. 
Given a priority function p and a reward function r the mean-payoff parity objective 
mp p r assigns the long-run average of the rewards if the parity objective is satisfied; 
otherwise it assigns _L. Formally for a play uj we have 

mn (, ,\ J mea( vM if Parity (w) = 1, 
I _L otherwise. 

For a reward function r : S — > M the max objective max r assigns to a play the 
maximum reward that appears in the play. Note that since S is finite, the number 
of different rewards appearing in a play is finite and hence the maximum is defined. 
Formally, for a play to — S1S2S3 ... we have max r (w) = m&x(r(si))i> - 



2.8 Values, optimal stratgies, and almost-sure winning states 

Given an MDP G, the value function for an objective / is the function from the 
state space S to the set R of reals. For all states s £ S, let Vg(/)(s) = sup [/]. 

7TG77 

In other words, the value Vg(/)(s) is the maximal expectation with which player 1 
can achieve her objective / from state s. A strategy tt is optimal from state s for 
objective / if Vg(/)(s) = [/]. For parity objectives, mean-payoff objectives, and 
max objectives pure memoryless optimal strategics exist in MDPs. 

Given an MDP G and a priority function p, we denote by WG(parity p ) ={s£ 

5 I Vc(parity p )(s) = 1}, the set of states with value 1. These states are called the 
almost-sure winning states for the player and an optimal strategy from the almost- 
sure winning states is called a almost-sure winning strategy. The set WG(parity p ) 
for an MDP G with priority function p can be computed in 0(d ■ ni) time, where 
n is the size of the MDP G and d is the number of priorities [13, 14]. For states 

6 Note that we do not distinguish null or positive recurrent states since we only consider 
finite Markov chains. 

7 We use the usual definition for closed, i.e., given a set Y, a set X C Y is closed under 
a relation R C Y x Y, if forall x G X and forall y G Y, if (a;, y) G R, then y G X. 
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fiffi(l) 

Fig. 3. Automaton Ai Fig. 4. System Mi Fig. 5. System M2 preferring n. 

in S \ T / l / G(parity p ) the parity objective is falsified with positive probability for all 
strategies, which implies that for all states in S \ VFG(parity p ) the value is less 
than 1 (i.e., V G (parity p )(s) < 1). 

3 Measuring Systems 

In this section, we start with an example to explain the problem and introduce our 
measure. Then, we define the measure formally and show finally, how to compute 
the value of a system with respect to the given measure. 

Example 3. Recall the example from the introduction, where we consider a re- 
source controller for two clients. Client i requests the resource by setting its re- 
quest signal Ti. The resource is granted to Client i by raising the grant signal gi. 
We require that the controller guarantees mutually exclusive access and that it is 
fair, i.e., a requesting client eventually gets access to the resource. Assume we pre- 
fer controllers that respond quickly. Figure 3 shows a specification that rewards a 
quick response to request j-j. The specification is given as a Mean-payoff automaton 
that measures the average delay between a request rj and a corresponding grant 
gi. Recall that transitions are labeled with a conjunction of literals and a reward 
in parentheses. In particular, whenever a request is granted the reward is 1, while 
a delay of the grant results in reward 0. The automaton assigns to each word in 
( 2 {n,si})u; the average reward. For instance, the value of the word ({ri}{ri, gi})" 
is (0 + l)/2 = 1/2. We can take two copies of this specification, one for each client, 
and assign to each word in (2'( ri > T ' 2 ' £ ' 1 > 92 f ) w the sum of the average rewards. E.g., 
the word {{ri, g2}{ri, gi}) u gets an average reward of 1/2 with respect to the first 
client and reward 1 with respect to the second client, which sums up to a total 
reward of 3/2. 

Consider the systems M\ and M2 in Figure 4 and 5, respectively. Transitions 
are labeled with conjunctions of input and output literals separated by a slash. 
System Mi alternates between granting the resource to Client 1 and 2. System M 2 
grants the resource to Client 2, if only Client 2 is sending requests. By default 
it grants the resource to Client 1. If both clients request, then the controller al- 
ternates between them. Both systems are correct with respect to the functional 
requirements describe above: they are fair to both clients and guarantee that the 
resource is not accessed simultaneously. 
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Fig. 6. Product of Mi with Specification Ai and A2. 



Though, one can argue that System M2 is better than Mi because the delay 
between requests and grants is, for most input sequences, smaller than the delay 
in System Mi. For instance, consider the input trace ({^H^i})". The response 
of System Mi is ({<?i}{<72}) w - Looking at the product between the system Mi and 
the specifications A\ and A 2 shown in Figure 6, we can see that this results in 
an average reward of 1. Similar, Figure 7 shows the product of M 2 , Ai, and A 2 . 
System M 2 responds with ({32H31})" and obtains a reward of 2. Now, consider 
the sequence ({Vi, r 2 })", which is the worst input sequence the environment can 
provide. In both systems, this sequences leads to a reward of 1, which is the lowest 
possible reward. So Mi and M 2 cannot be distinguished with respect to their worst 
case behavior. 

In order to measure a system with respect to its average behavior, we aim 
to average over the rewards obtained for all possible input sequences. Since we 
have infinite sequences, one way to average is the limit of the average over all 
finite prefixes. Note that this can only be done if we know the values of finite 
words with respect to the quantitative specification. For instance, for a finite- 
state machine M and a Mean-payoff automaton A, we can define the average 
as V0 A (M) := lim^oo J2 w e£? La{Om{w u )). However, if we truly want to 
capture the average behavior, we need to know, how often the different parts of 
the system are used. This corresponds to knowing how likely the different input 
sequences are. The measure above assumes that all input sequences are "equally 
likely" . In order to define measures that take the behavior of the environment into 
account, we use a probability measure on input words. In particular, we consider 
the probability space (S^J 7 ,^) over Sf, where T is the cr-algebra generated 
by the cylinder sets of E u (which are the sets of infinite words sharing a common 
prefix) (in other words, we have the Cantor topology on Sf) and /j, is a probability 
measure defined on (Z'",J r ). We use finite labeled Markov chains to define the 
probability measure fj,. 

Example 4- Recall the controller of Example 3. Assume we know We can represent 
such a behavior by assigning probabilities to the events in E = 2^ ri ' r ' 2 ^ . Assume 
Client 1 sends requests with probability p\ and Client 2 sends them with prob- 
ability P2 < pi, independent of what has happened before. Then, we can build 
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Fig. 7. Product of Mi, Ai, and A 2 . 

a labeled Markov chain with four states S p = {s , si, s 2 , S3} each labeled with a 
letter in S, i.e., A(s ) = {}, A(si) = {r 2 }, A(s 2 ) = {n}, and A(s 3 ) = {r 1; r 2 }, 
and the following transition probabilities: (i) 5(sj)(s ) = (1 — Pi) • (1 — P2), (h) 
^(s»)(si) = (1 - Pi) ■ P2, (hi) S(si)(s2) = Pi ■ (1 - p 2 ), and (iv) 5(si)(s 3 ) = pi ■ p 2 , 
for all i e {0,1,2,3}. 

Once we have a probability measure \x on the input sequences and the associ- 
ated expectation measure E, we can define a satisfaction relation between systems 
and specifications and a measure for a system with respect to a qualitative and a 
quantitative specification. 

Definition 1 (Satisfaction). Given a state machine M with input alphabet Si, 
a qualitative specification ip, and a probability measure [i on (Sf ,J-~), we say that 
M satisfies ip under /x (written M ip) iff M satisfies ip with probability 1, i.e., 
E[/p o Om) = 1; where E is the expectation measure for /x. 

Recall that we use a quantitative specification to describe how "good" a system 
is. Since we aim for a system that satisfies the given (qualitative) specification and 
is "good" in a given sense, we define the value of a machine with respect to a 
qualitative and a quantitative specification. 

Definition 2 (Value). Given a state machine M, a qualitative specification ip, 
quantitative specification -0, and a probability measure fi on (Sf , T), the value of 
M with respect to p> and ip under /1 is defined as the expectation of the function 
ipoOM under the probability measure fi if M satisfies ip under fi, and _L otherwise. 
Formally, let E be the expectation measure for fx, then 

V^(M):=l ni, °° M] ^ M H^, 
M 1 _L otherwise. 

Ifp is the set of all words, then we write Vf{M). Furthermore, we say M optimizes 
ip under fj,, ifV$(M) > V^(M') for all systems M' . 

In Definition 2, we could also consider the traditional satisfaction relation, i.e., 
M \= ip. We have algorithms for both notions but we focus on satisfaction under 
H, since satisfaction with probability 1 is the natural correctness criterion, if we 



12 



are given a probabilistic environment assumption. Note that for safety specifica- 
tions the two notions coincide, because we assume that the labeled Markov chain 
defining the input distribution is complete. 8 For parity specifications, the results 
in this section would change only slightly if we replace M tp by M \= ip. In 
particular, instead of analyzing a Markov chain with parity objective, we would 
have to analyze an automaton with parity objective. We discuss the the alternative 
synthesis algorithm in the conclusions. 

Lemma 1. Given a finite- state machine M, a safety or a parity automaton A, a 
mean-payoff automaton B, and a labeled Markov chain (G, Xq) defining a proba- 
bility measure /j, on (Sf , J 7 ), we can construct a Markov chain G' — (S", s' , E' , 5'), 
a reward function r' , and a priority function p' such that 



Proof. To build G', we first build the product of M, A, and B (cf. Figure 6), 
which is a finite-state machine C = (Q, qo, A, A) augmented with a (transition) 
reward function r : Q x Ej — > N and a priority function p : Q — >• {0, . . . , d}. Let 
G = (S,s ,E,6), then we construct a Markov chain G' = (S',s' ,E' U E",S'), a 
reward function r' : S' — >• N, and a priority function p' : S' — > {0, . . . , d} as follows: 



S' = Q x S x {0,1}, s' = (gb,*o,0), E' = {((<?, s, 0), (q, s', 1)) | («,«') e E}, 
E" = {((q, a, l)(q\ a, 0)) | A(q, X G (a)) = q'}, and 



In G' every transition of M x A is split into two parts: in the first part, G' 
chooses the input value according to the distribution given by G. In the second 
part, G' outputs the value from M corresponding to the chosen input. The reward 
given by A for this transition is assigned to the intermediate state, i.e., r'(s') = 
r(q, Ag(s)) if s' = (q,s,l), otherwise r'(s') — 0, and the priorities are copied 
from A, i.e., p'((q,s,b)) = p{q). If A is a safety automaton, we overwrite the 
rewards function r' to map all states a' £ S' with priority 1 to _L, i.e., r'(s) = _L if 
p'(s) = 1. This allows us to ignore the priority function and compute the system 
value based on the mean-payoff value. 

Note that we can also compute M ^ La and V^ B (M) separately by building 
two MCs: (1) G' augmented with a priority function p' and (2) G" augmented with 

8 Recall that a Markov chain is complete, if in every state there is an edge for every 
input value. Since every edge has a positive probability, also every finite path has a 
positive probability and therefore a system violating a safety specification will have a 
value _L. If the Markov chain is not complete (i.e., we are given an input distribution 
that assigns probability to some finite input sequences), we require a simple pre- 
processing step that restricts our algorithms to the set of states satisfying the safety 
condition independent of the input assumption. This set can be computed in linear 
time by solving a safety game. 




2 • V(?'(mearv)(so) if A is a safety automaton, 
2 • Vc (m p p , r , ) (sq ) otherwise. 
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a reward function r". Then, V^ a - Lb {M) = V G /(mearv)(s(,), if V G » (parity p „)(s(,') = 
1, otherwise V^ a ' Lb (M) — _L. Even though, the approach with two MCs has a 
better complexity, we constructed a single MC to show the similarity between 
verification and synthesis. □ 

Theorem 1. Given a finite-state machine M, a parity automaton A, a mean- 
payoff automaton B, and a labeled Markov chain (G, A G ) defining a probability 
measure /i, we can compute the value \>La,l b (M) in polynomial time. Furthermore, 
if(G,Xo) defines a uniform input distribution, then Vq b {M) = V^ B (M) 9 . 

Proof. Due to Lemma 1, we construct Markov chain G', a reward function r', and a 
priority function p' such that V^ a ' Lb (M) = V G /(mp p , r /)(s' ). Since G' is a Markov 
chain, we can compute WG<(parity p >) and V G / (mearv)(so) in polynomial time [13, 
30], and V G /(mp / /)(s()) = V G ' (mean r ')(sp) if s' G W G /, and _L otherwise. Note 
that the value V G /(mearv)(so) is the sum over all states s of the reward at s (i.e., 
r'(s)) times the long-run average frequency of being in s (the Cesaro limit of being 
at s [30]). □ 

Example 5. Recall the two system Mi and Mi (Figure 4 and 5, respectively) and 
the specification A (cf. Figure 3) that rewards quick responses. The two systems 
are equivalent wrt the worst case behavior. Let us consider the average behavior: 
we build a Markov chain G that assigns 1/4 to all events in 2^ ri ' r ^ . To measure 
Mi, we take the product between G and Mi x A (shown in Figure 6). The 
product looks like the automaton in Figure 6 with an intermediate state for each 
edge. This state is labeled with the reward of the edge. All transition leading to 
intermediate states have probability 1/2, the other once have probability 1. So 
the expectation of being in a state is the same for all four main states (i.e., 1/8) 
and half of it in the eight intermediate states (i.e., 1/16). Four (intermediate) 
states have a reward of 1, four have a reward of 2. So we get a total reward of 
4 • 1/16 + 4 • 2 • 1/16 = 3/4, and a system value of 1.5. This is expected when 
looking at Figure 6 because each state has two inputs resulting in a reward of 2 
and two inputs with reward 1. For System Mi, we obtain Markov chain similar to 
Figure 7 but now the probability of the transitions corresponding to the self-loops 
on the initial state sum up to 3/4. So it is more likely to state in the initial state, 
then to leave it. The expectation for being in the states (qo,qo,qo),(qi,qo,qi), 
and (qi,qi,qa) are 2/3, 2/9, and 1/9, respectively, and their expected rewards 
are (2 + 2 + 2 + l)/4 = 7/4, 3/2, and 3/2, respectively. So, the total reward of 
System M 2 is 2/3 • 7/4 + 2/9 • 3/2 + 1/9 • 3/2 = 1.67, which is clearly better than 
the value of system Mi for specification A. 

4 Synthesizing Optimal Systems 

In this section, we show how to construct a system that satisfies a qualitative 
specification and optimizing a quantitative specification under a given probabilistic 

9 We can show that this measure is invariant under transformations of the computation 
tree. 
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environment. First, we reduce the problem to finding an optimal strategy in an 
MDP for a mean-payoff (parity) objective. Then, we show how to compute such 
a strategy using end components and a reduction to max objective. In this part, 
we also show how to decide if the given specification can be implemented by a 
finite-state system that is optimal. In case that the specification does not permit 
such an implementation, we show how to construct, for every e > 0, a finite-state 
system that is e-optimal. At the end of the section, we provide a linear program 
that computes the value function of an MDP with max objective, which shows that 
the value function of an MDP with mean-payoff parity objective can be computed 
in polynomial time. 

4.1 Reduction to MDP with mean-payofF (parity) objectives 

Lemma 2. Given a safety (resp. parity) automaton A, a mean-payoff automa- 
ton B, and a labeled Markov chain (G,Xg) defining a probability measure fi on 
(Sf,^), we can construct a labeled MDP (G' , X G >) with G' = (S' , s' , E', S[, S' P , 5'), 
a reward function r' , and a priority function p' such that every pure strategy tt 
that is optimal from state s' for the objective mean r < (resp. mp p , r , ) and for which 
(mearv) ^ _L (resp. EJ (mp p , r ,) ^ ±) corresponds to a state machine M that 
satisfies La under and optimizes Lb under [i. 

The construction of G 1 is very similar to the construction used in Lemma 1. 
Intuitively, G" alternates between mimicking a move of G and mimicking a move 
of A x B x C, where C is an automaton with |I7o|-states that pushes the output 
labels from transitions to states, i.e., the transition function 8c of C is the largest 
transition function s.t. Vs, s', a, cr' : Sc{s,cr) — Sc{s' 7 a') — > a = a'. Priorities p' 
are again copied from A and rewards r' from B. The labels for Ac are either taken 
from Xq (in intermediate state) or they correspond to the transitions taken in C. 
Every pure strategy in G' fixes one output value for every possible input sequence. 
The construction of the state machine depends on the structure of the strategy. 
For pure mcmoryless strategies, the construction is straight forward. At the end 
of this section, we discuss how to deal with other strategies. 

The following theorem follows from Lemma 2 and the fact that MDPs with 
mean-payoff objective have pure memoryless optimal strategies and they can be 
computed in polynomial time (cf. [30]). 

Theorem 2. Given a safety automaton A, a mean-payoff automaton B, and a 
labeled Markov chain (G, Xq) defining a probability measure /i, we can construct a 
finite-state machine M (if one exists) in polynomial time that satisfies 10 La under 
\i and optimizes Lb under \i. 

4.2 MDPs with mean-payofF parity objectives 

It follows from Lemma 2 that if the qualitative specification is a parity automaton, 
along with the Markov chain for probabilistic input assumption, and mean-payoff 

10 Recall that for safety specification M |= M La and M |= La coincide. 
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automata for quantitative specification, then the solution reduces to solving MDPs 
with mean-payoff parity objective. In the following we provide an algorithmic 
solution of MDPs with mean-payoff parity objective. We first present few basic 
results on MDPs. 

End components in MDPs [22, 19] play a role equivalent to closed recurrent sets 
in Markov chains. Given an MDP G = (S, Sq,E,Si,Sp,5) , a set U C S of states is 
an end component [22, 19] if U is (5-closed (i.e., for all s G U CiSp we have E(s) C U) 
and the sub-game of G restricted to U (denoted G \ U) is strongly connected. We 
denote by £{G) the set of end components of an MDP G. The following lemma 
states that, given any strategy (memoryless or not), with probability 1 the set of 
states visited infinitely often along a play is an end component. This lemma allows 
us to derive conclusions on the (infinite) set of plays in an MDP by analyzing the 
(finite) set of end components in the MDP. 

Lemma 3. [22, 19] Given an MDP G, for all states s <E S and all strategies 
7T en, we have ^({uj | Inf(w) £ £{G)}) = 1. 

For an end component U £ £{G), consider the memoryless strategy iru that 
plays in any state s in U n S\ all edges in E(s) n U uniformly at random. Given 
the strategy -Ku, the end component U is a closed connected recurrent set in the 
Markov chain obtained by fixing wu. 

Lemma 4. Given an MDP G and an end component U € £{G), the strategy wu 
ensures that for all states s e U , we have fi^ u ({uj | Inf(w) = U}) = 1. 

It follows that the strategy iru ensures that from any starting state s, any other 
state t is reached in finite time with probability 1. From Lemma 4 we can conclude 
that in an MDP the value for mean-payoff parity objectives can be obtained by 
computing values for end-components and then applying the maximal expectation 
to reach the values of the end components. 

Lemma 5. Consider an MDP G with state space S, a priority function p, and 
reward function r such that (a) G is an end-component (i.e., S is an end compo- 
nent) and (b) the minimum priority in S is even. Then the value for mean-payoff 
parity objective for all states coincide with the value for mean-payoff objective, i.e., 
for all states s we have VG(mp pr )(s) = V(5(mean r )(s). 

Proof. We consider two pure memoryless strategies: one for the mean-payoff ob- 
jective and one for reaching the minimum priority objective and combine them 
to produce the value for mean-payoff parity objective. Consider a pure memory- 
less optimal strategy 7r m for the mean-payoff objective; and the strategy irs is a 
pure memoryless strategy for the stochastic shortest path to reach the states with 
the minimum priority (and the priority is even). Observe that under the strategy 
7T5 we obtain a Markov chain such that every closed recurrent set in the Markov 
chain contains states with the minimum priority, and hence from any state s a 
state with the minimum priority (which is even) is reached in finite time with 
probability 1. The mean-payoff value for all states s e S is the same: if we fix the 
memoryless strategy ir u that chooses all successors uniformly at random, then we 
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get a Markov chain as the whole set S as a closed recurrent set, and hence from 
all states s £ S any state t £ S is reached in finite time with probability 1, and 
hence the mean-payoff value at s is at least the mean-payoff value at t. It follows 
that for all s,t £ S we have V(mean r )(s) = V(mean r )(i), and let us denote the 
uniform value by v*. The strategy 7r m is a pure memory less strategy and once it 
is fixed we obtain a Markov chain and the limit of the average frequency of the 
states exists and since ir m is optimal it follows that for all states s £ S we have 



where Oi is the random variable for the i-th state of a path. Hence the strategy 
Tr m ensures that for any e > 0, there exists j(e) £ N such that if ir m is played for 
any I > j(e) steps then the expected average of the rewards for t steps is within 
e of the mean-payoff value of the MDP, i.e., for all s £ S, for all I > j(e) we have 



Let (3 be the maximum absolute value of the rewards. The optimal strategy for 
mean-payoff objective is played in rounds, and the strategy for round i is as follows: 

1. Stage 1. First play the strategy its till the minimum priority state is reached. 

2. Stage 2. Let e% — If the game was in the first stage in this (i-th round) for ki 
steps, then play the strategy 7r m for ti steps such that ii > max{j(si), i- hi-fi}. 
This ensures that the expected average of the rewards in round i is at least 



Then the strategy proceeds to round i + 1. 

The strategy ensures that there are infinitely many rounds, and hence the min- 
imum priority that is visited infinitely often with probability 1 is the minimum 
priority of the end component (which is even). This ensures that the parity objec- 
tive is satisfied with probability 1. The above strategy ensures that the value for 
the mean-payoff parity objective is 





i=0 




i% ' &i hi • j5 




lim inf (v* — r) = v* . 
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This completes the proof. 



□ 



Lemma 5 shows that in an end component if the minimum priority is even, 
then the value for mean-payoff parity and mean-payoff objective coincide if we 
consider the sub-game restricted to the end component. The strategy constructed 
in Lemma 5 requires infinite memory and in the following lemma we show that for 
all £ > 0, the e-approximation can be achieved with finite memory strategies. 

Lemma 6. Consider an MDP G with state space S, a priority function p, and 
reward function r such that (a) G is an end- component (i.e., S is an end compo- 
nent) and (b) the minimum priority in S is even. Then for all e > there is a 
finite-memory strategy tt £ for which the mean-payoff parity objective value for all 
states is within e of the value for the mean-payoff objective, i.e., for all states s 
we have EJ e [mp p r ] > V(3(mean r )(s) —e. 

Proof. The proof of the result is similar as the proof of Lemma 5 and the key 
difference is that the Stage 1 and Stage 2 strategies will be played for a fixed 
number of rounds, depending on s > 0, but will not vary across rounds. Fix 
e > 0, and we show how to construct a finite-memory strategy to achieve 2 • e 
approximation. As e > is arbitrary, the desired result will follow. As in Lemma 5 
we consider the two pure memoryless strategies: one for the mean-payoff objective 
and one for reaching the minimum priority objective and combine them to produce 
the approximation of the value for mean-payoff parity objective. Consider a pure 
memoryless optimal strategy 7r m for the mean-payoff objective; and the strategy 
7rs is a pure memoryless strategy for the stochastic shortest path to reach the 
states with the minimum priority (and the priority is even). As in Lemma 5 we 
observe that under the strategy tts we obtain a Markov chain such that every 
closed recurrent set in the Markov chain contains states with the minimum priority, 
and hence from any state s a state with the minimum priority (which is even) is 
reached in finite time with probability 1. Let n be the number of states of the end 
component, and let r\ be the minimum positive transition probability in the end 
component. The strategy its ensures that from all states s there is a path to the 
minimum even priority state in the graph of the Markov chain, and the path is of 
length at most n. Hence the strategy tts ensures that from all states s a minimum 
priority state is reached within n steps with probability at least rf 1 (we will refer 
this as Property 1 later in the proof). As shown in Lemma 5 the mean-payoff value 
for all states s £ S is the same: for all s, t e S we have V(mean r )(s) = V(mean r )(i), 
and let us denote the uniform value by v*. The strategy TT m is a pure memoryless 
strategy and once it is fixed we obtain a Markov chain and the limit of the average 
frequency of the states exists and since TT m is optimal it follows that for all states 
seSwe have 



where 0, is the random variable for the i-th state of a path. Hence the strategy 7T m 
ensures that for any e\ > 0, there exists j( £ i) ^ ^ such that if TT m is played for 
any t > j( e i) steps then the expected average of the rewards for £ steps is within 
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£i of the mean-payoff value of the MDP, i.e., for all s e S, for all £ > we 
have 

1 £ 

-.^EMr(e,)]>,*- £l . 

1=0 

Let f$ be the maximum absolute value of the rewards. The finite-memory 2 • e- 
optimal strategy for the mean-payoff parity objective is played in rounds, but in 
contrast to Lemma 5 in every round the same strategy is played. The strategy for 
a round is as follows: 

1. Stage 1. First play the strategy its for n steps. 

2. Stage 2. Play the strategy ir m for I steps such that I > max{j(e), - ■ n ■ (3}. 
This ensures that the expected average of the rewards in a round is at least 
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Then the strategy proceeds to the next round. 

The above strategy is a finite-memory strategy as it needs to remember the number 
n for first stage and the number £ for second stage. The above strategy ensures 
that the value for the mean-payoff objective is at least v* — 2 • e. To complete the 
proof that the strategy is a 2 • e optimal strategy we need to show that the parity 
objective is satisfied with probability 1. We call a round a success is a minimum 
even priority state is visited. Hence we need to argue that with probability 1 there 
are infinitely many success rounds. Every round is a success with probability at 
least a = rf 1 > (as by Property 1 the strategy its ensures that a minimum 
priority state is visited with probability at least a in n steps). For round i, the 
probability that there is no success round after round i is lim^oo a k — 0. Since 
the countable union of measure zero events has measure zero, it follows that for 
any round i, the probability that there is no success round after round i is zero. It 
follows that the probability that there are infinitely many success rounds is 1, i.e., 
the parity objective is satisfied with probability 1. This completes the proof. □ 

In the following we show that if a system can achieve the optimal value with 
a pure finite-state strategy, then it can achieve the optimal value also with a pure 
memoryless strategy. 
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Lemma 7. Consider an MDP G = {S,sq,E,S\,Sp,8), a priority function p, 
and reward function r such that (a) S is an end component and (b) the minimum 
priority in S is even. If there exists an optimal pure finite-state strategy tt, then 
there exists an optimal pure memoryless strategy tt' . 

Proof. Let M be the Markov chain obtained by fixing the strategy in G to 7r, i.e., 
M is the synchronous product of G and a finite-state system describing tt. Since 
the mean-payoff parity objective is prefix-independent and S is an end-component 
(i.e., all states can reach each other with probability 1), all recurrence classes in M 
have the same mean-payoff parity value. Therefore we can construct a finite-state 
strategy fr such that the Markov chain M obtained by fixing the strategy in G 
to 7r has a single recurrence class. Let C be the single recurrence class of M and 
let C\g be the set of states in G that appear in C. We know that min(p(C'| G )) is 
even. Let Ci, . . . , Gk be the component recurrence classes that arise if we fix an 
optimal pure memoryless strategy for the mean-payoff objective in G restricted 
to C\g- Since % is an optimal strategy, C and its component recurrence classes 
Ci, . . . , Cfc have the same mean-payoff value. Otherwise, assume there exists some 
d that has a higher value, then an infinite-state strategy that alternates between 
playing a strategy that ensures Ci and a strategy to reach the minimal priority 
(cf. proof of Lemma 5) would achieve a higher mean-payoff parity value, which 
contradicts the assumption that i\ is an optimal strategy. Similarly, if some d has 
a lower value, then removing d would again result in a better strategy. If there 
is a recurrence class Ci such that min(p(Cj)) is odd, then we can ignore Ci in C 
without changing the value. Finally, assume there are two component recurrent 
classes C\ and C2 such that min(p(Ci)) and min(p(C2)) is even, then we can ignore 
one of them without changing the payoff value. From these properties, it follows 
that if there exists an optimal finite-state strategy ir, then there exists a recurrence 
class Ci s.t. the minimal priority is even and the mean-payoff value is the same as 
the mean-payoff value of tt. The desired pure memoryless strategy tt' enforces the 
recurrence class Ci by playing a strategy to stay within d for states in d and for 
all states outside of d it plays a pure memoryless almost-sure winning strategy 
to reach d. 

4.3 Algorithm based on linear programming 

Computing best end-component values We first compute a set S* such that 
every end component U with min(p({/)) is even is a subset of S*. We also compute 
a function /* : S* — > M+ that assigns to every state s £ S* the value for the mean- 
payoff parity objective that can be obtained by visiting only states of an end 
component that contains s. The computation of S* and /* is as follows: 

1. Sq is the set of maximal end-components with priority and for a state 
s e Sq the function /* assigns the mean-payoff value when the sub-game 
is restricted to Sq (by Lemma 5 we know that if we restrict the game to the 
end-components, then the mean-payoff values and mean-payoff parity values 
coincide) ; 
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2. for i > 0, let S^i be the set of maximal end components with states with 
priority 2i or more and that contains at least one state with priority 2i, and 
/* assigns the mean-payoff value of the MDP restricted to the set of end 
components S^. 

The set S* = [j\ d J^ S* v This procedure gives the values under the end-component 
consideration. In the following, we show how to check if an end-component has a 
pure memory less strategy that achieves the optimal value. 

Checking end-component for memoryless strategy Let U G S* be a max- 
imal end-component with a minimal even priority, as computed in the previous 
section. Without loss of generality we assume that the MDP is bipartite, i.e., 
player- 1 states and probabilistic states strictly alternate along every path. Let 
E\ = E fl (S\ x Sp) be the set of player-1 edges, i.e., the set of edges starting from 
a player-1 state. The mean-payoff value of an end-component can be computed 
using the following linear program for MDPs with unichain strategies (cf. [42, 22]): 

maximize • (r(s) + r(t)) (1) 

(M)e-Ei 

subject to ^ x (s,t) = 1 (2) 
(s.t)e-Ei 

V seSl ^2 X (s,t)= Y X {s ^ t ry5{t',s) (3) 

tes P ,(s,t)eE (s',c)eEi 

The program has one variable X( Stt ) for every outgoing edge of a player-1 state. 
Intuitively, x^ s t ^ represents the frequency of being in state s and choosing the edge 
to state t. Note that all states s, t such that xt Stt \ > belong to a recurrence class. 
In order to check if there exists an optimal pure memoryless strategy in U, we 
call a modified version of the linear program above for every even priority d. In 
particular, we add the following additional constraints: 

V se s 1 V teSp:(Sjt)(E _ E = if p(s) <dor p(t) < d (4) 

It requires that in the resulting recurrence class no priority small than d is vis- 
ited. To ensure that the resulting recurrence class includes at least on state with 
priority d, we add the following term to the objective function (Eqn. 1). 

x (s,t) (5) 

(s,t)£Ei s.t. p(s)—d or p(t) — d 

Finally, let v be the mean-payoff value for U obtained by solving the linear program 
with Eqn. 1 to 3. If there exists an even priority d such that the modified linear 
program (Eqn. 1 to 5) has a value strictly greater than v, then there exists a pure 
memoryless strategy in S that achieves the optimal value. If the value of the linear 
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program is strictly greater than v, then there exists a witness priority d and a 
corresponding edge (s,t) G E\ such that xr s<t \ in Eqn. 5 has a positive value. 

In order to compute the maximal reachability expectation we present the fol- 
lowing reduction. 

Transformation to MDPs with max objective Given an MDP G = (S, So, E, Si, Sp, 8) 

with a positive reward function r : S — > M + and a priority function p : S — > 
{0, ...,d}, and let S* and /* be the output of the above procedure. We con- 
struct an MDP G = (S, So, E, Si, Sp,S) with a reward function r as follows: 
S = S U S* (i.e., the set of states consists of the state space S and a copy S* 
of S*), E = BU{(s,s) | s G S^nSiandsis the copy of s in S*}u{(s,s) | s G S*} 
(i.e., along with edges E, for all player 1 states s in S* there is an edge to its copy 
s in S* , and all states in S* are absorbing states), Si = Si U S*, r(s) = for all 
s e S and f(s) = f*(s), where s is the copy of s. We refer to this construction 
as max conversion. The relationship between Vc(mp ) and V^rnax-p) can be 
established as follows. 

1. Consider a strategy ir in G. If an end component U is visited infinitely often, 
and min(p([/)) is odd, then the payoff is _L, and if min(p(£7)) is even, then 
the maximal payoff achievable for the mean-payoff parity objective is upper 
bounded by the payoff of the mean-payoff objective (which is assigned by /*). 
It follows that for all s G S we have VG(mp p r )(s) < Vg(max-)(s). 

2. Let 7T be a pure memoryless optimal strategy for the objective max F in G. We 
fix a strategy 7r in G as follows: if at a state s G S* the strategy W chooses 
the edge (s,s), then in G on reaching s, the strategy 7r plays according to the 
strategy of an winning end component that ensures the mean-payoff value (as 
shown in Lemma 5), otherwise ir follows W. It follows that for all s G S we 
have VG(mp p r )(s) > \/-^(maxr){s). 

It follows that for all s G S we have VG(mp p r )(s) — VQ-(max-)(s). In order to 
solve G with the objective max-, we set up the following linear program and solve 
it with a standard LP solver (e.g., [33]). 

Linear programming for the max objective in G The following linear pro- 
gram characterizes the value function Vg(max-). Observe that we have already 
restricted ourselves to the almost-sure winning states WG(parity p ), and below we 
assume WG(parity p ) = S. For all s G S we have a variable x s and the objective 
function is min^ se ^a; s . The set of linear constraints are as follows: 

Vs G ~S; 

Vs G 5*; 

Vs € Si, (s, t) G E; 
Vs G S P . 

The correctness proof of the above linear program to characterize the value func- 
tion Vg(maXr) follows by extending the result for reachability objectives [30]. The 



x s > 
x s = r(s) 
x s > x t 
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key property that can be used to prove the correctness of the above claim is as 
follows: if a pure memoryless optimal strategy is fixed, then from all states in S, 
the set S* of absorbing states is reached with probability 1. The above property 
can be proved as follows: since r is a positive reward function, it follows that 
for all s 6 S we have VG(mp pT .)(s) > 0. Moreover, for all states s <E S we have 
Vg(maxr)(s) = V G (mp p r )(s) > 0. Observe that for all s e S we have r(s) = 0. 
Hence if we fix a pure memoryless optimal strategy tt in G, then in the Markov 
chain G n there is no closed recurrent set C such that C C S. It follows that for all 
states s G S, in the Markov G w , the set S* is reached with probability 1. Using the 
above fact and the correctness of linear-programming for reachability objectives, 
the correctness proof of the above linear-program for the objective max F in G can 
be obtained. This shows that the value function V<5(mp p r ) for MDPs with reward 
function r can be computed in polynomial time. We can search for a pure mem- 
oryless strategy that achieves the optimal value by slightly modify the presented 
procedure. First, we check for each end-component if a pure memoryless strategy 
with optimal value exists. Then, in the transformation to MDP with max objec- 
tive, we create copy states only for states in end-components that have optimal 
pure memoryless strategies. In all states, for which the values obtain from the two 
different transformation to MDP with max objective coincide, a pure memoryless 
strategy that achieves the optimal value exists. This given us the following lemma. 

Lemma 8. Given a MDP with a mean-payoff parity objective, the value function 
for the mean-payoff parity objective can be computed in polynomial time. We can 
decide in polynomial time if there exists a pure memoryless (or finite- state) strategy 
that achieves the optimal value. 

Note that, in general, the optimal strategies constructed for mean-payoff parity 
requires memory, but the memory requirement is captured by a counter (which 
can be represented by a state machine with state space N). The optimal strategy 
as described in Lemma 5 plays two memoryless strategies, and each strategy is 
played a number of steps which can be stored in a counter. Using Lemma 6, we 
can fix the size the counter for any e > and obtain a finite-state strategy that is 
e-optimal. Lemma 7 and the procedure above allows us to check in polynomial time 
if there exists a pure memoryless strategy that achieves the optimal value. This 
result is quite surprising because the related problem of computing the optimal 
pure memoryless strategy, i.e., the strategy that is optimal with respect to all pure 
memoryless strategy is NP-complete; the upper bound follows from Theorem 1 and 
the fact that emptiness of parity automata can be checked in polynomial time [36] ; 
the lower bound follows from a reduction of the directed subgraph homeomorphism 
problem [31]. 

Lemma 2 and Lemma 8 yield the following theorem. 

Theorem 3. Given a Parity specification A, a Mean-payoff specification B, and 
a labeled Markov chain (G, A) defining a probability measure \i on (Ef,F), we can 
construct a state machine M (if one exists) in polynomial time that satisfies La 
under /i and optimizes Lb under /i. We can decide in polynomial time if M can 
be implemented by a finite-state machine. If M requires infinite memory, then for 
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all e > 0, we can construct a finite-state machine M' that satisfies La under \i 
and optimizes Lb under \i within e. 

5 Experimental Results 

In this section we illustrate which types of systems, we can construct using qualita- 
tive and quantitative specifications under probabilistic environment assumptions. 
We have implemented the approach as part of Qtjasy, our quantitative synthesis 
tool [15]. Our tool takes qualitative and quantitative specifications and automati- 
cally constructs a system that satisfies the qualitative specification and optimizes 
the quantitative specification, if such a system exists. The user can choose between 
a system that satisfies and optimizes the specifications (a) under all possible en- 
vironment behaviors or (b) under the most-likely environment behaviors given as 
a probability distribution on the possible input sequences. 

We are interested in the latter functionality, i.e., in systems that are optimal for 
the average-case behaviors of the environment. In this specification consists 

of (i) a safety or a parity automaton A, (ii) a mean-payoff automaton B, and an 
environment assumption /j,, given as a set of probability distributions d s over input 
letters for each state s of B. Our implemenation first builds the product of A 
and B. Then, it construct the corresponding MDP G. If A is a safety specification, 
our implementation computes an optimal pure memoryless strategy using policy 
iteration for multi-chain MDPs [30] . Finally, if the value of the strategy is different 
from _L, then it converts the strategy to a finite-state machine M which satisfies 
La (under /i) and is optimal for B under fi. In the case of parity specifications, 
we implemented the algorithm described in Section 4.2. Then, our implementation 
produces two mealy machines Mi and Mi as output: (i) M\ is optimal wrt the 
mean-payoff objective and (ii) M2 almost-surely satisfies the parity objective. The 
actual system corresponds to a combination of the two mealy machines based on 
inputs from the environment switching over from one mealy machine to another 
based on a counter as explained in Section 4.2. More precisely, if we use the 
strategy used in the proof of Lemma 5, we obtain an optimal but infinite-state 
system, because the size of the counter cannot be bounded. If we aim for a finite- 
state system, we can use the strategy suggested in proof of Lemma 6 leading to 
a finite-state system with an ^-optimal value. Furthermore, Lemma 7 and the 
corresponding linear program in Section 4.3 allows us to check if there exists 
an optimal pure finite-state strategy. In this case, we can return a single mealy 
machine. 

5.1 Priority-driven Controller. 

In our first experiment, we took as the quantitative specification B the product of 
the specifications A\ and A2 from Example 3 (Figure 3), where we sum the weights 
on the edges. The qualitative specification is a safety automaton A ensuring mu- 
tually exclusive grants. We assumed the constant probabilities P{{r\ = 1}) = 0.4 
and P({r2 = 1}) = 0.3 for the events n = 1 and r-i — 1, respectively. The opti- 
mal machine constructed by the tool is shown in Figure 8. Note that its behavior 
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Fig. 8. Optimal Mealy machine for the 2-client specification without response constraints 
and the safety automaton G2 



Table 1. Results for 2 to 7 clients without response constraints 



Clients States 


in 


A x B States in 


G States in 


M Value of M 


Time in s 


2 


4 


13 


2 


1.854 


0.50 


3 


8 


35 


4 


2.368 


0.81 


4 


16 


97 


8 


2.520 


1.64 


5 


32 


275 


16 


2.534 


3.43 


6 


64 


793 


32 


2.534 


15.89 


7 


128 


2315 


64 


2.534 


34.28 



does not depend on the state, i.e., State qo and q\ are simulation equivalent and 
can be merged. Since our tool does not minimize state machines yet, we obtain 
a system with two states. This system behaves like a priority-driven scheduler. It 
always grants the resource to the client that is more likely to send requests, if she 
is requesting it. Otherwise, the resource is granted to the other client. Intuitively, 
this is optimal because Client 1 is more likely to send requests and so missing a 
request from Client 2 is better than missing a request from Client 1. 



5.2 Fair Controller. 

In the second experiment, we added response constraints to the safety specification. 
The constraints are given as safety automata that require that every request is 
granted within two steps. We added one automaton d for each client i and the 
final qualitative specification was A x C\ x ft. The optimal machine the tool 
constructs is System M 2 of Example 3 (Figure 5). System M 2 follows the request 
sent, if only a single request is send. If both clients request simultaneously, it 
alternates between gi and g 2 . If none of the clients is requesting it grants g\. Recall 
that system Mi and M 2 from Example 3 exhibit the same worst-case behavior, so 
a synthesis approach based on optimizing the worst-case behavior would not be 
able to construct M 2 . 
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Table 2. Results for 2 to 4 clients with response constraints 



Clients States 


in 


A x B States in 


G States in 


M Value of M 


2 


3 


11 


3 


1.850 


3 


34 


156 


16 


2.329 


4 


125 


557 


125 


2.366 



5.3 General Controllers. 

We reran both experiments for several clients. Again, the quantitative specification 
was the product of A^s. We used a skewed probability distribution with P({r n = 
1}) = 0.3 and P{{r l = 1}) = P({n +1 = 1}) +0.1 for 1 < i < 6 and the qualitative 
specification required mutual exclusion. Table 1 shows in the first three columns 
the number of clients, the size of the specification (A x B), and the size of the 
corresponding MDP (G). Column 4 and 5 show the size and the value of the 
resulting machine (M), respectively. The last column shows the time needed to 
construct the system. The runs took between half a second and half a minute. 
The systems generated as a result of this experiment have an intrinsic priority 
to granting requests in order of probabilities from largest to smallest. Table 2 
shows the results when adding response constraints that require that every request 
has to be granted within the next n steps, where n is the number of clients. 
This experiment leads to quite intelligent systems which prioritize with the most 
probable input request but slowly the priority shifts to the next request variable 
cyclically resulting into servicing any request in n steps when there are n clients. 
Note that these systems are (as expected) quite a bit larger than the corresponding 
priority-driven controllers. 

6 Conclusions and Future Work 

In this paper we showed how to measure and synthesize systems under proba- 
bilistic environment assumptions wrt qualitative and quantitative specifications. 
We considered the satisfaction of the qualitative specification with probability 1 
(M if ). Alternatively, we could have considered the satisfaction of the qualita- 
tive specification with certainty (M \= tp). For safety specification the two notions 
coincide, however, they are different for parity specification. The notion of satisfac- 
tion of the parity specification with certainty and optimizing the mean-payoff spec- 
ification can be obtained similar to the solution of mean-payoff parity games [12] by 
replacing the solution of mean-payoff games by solution of MDPs with mean-payoff 
objectives. However, since solving MDPs with parity specification for certainty is 
equivalent to solving two-player parity games, and no polynomial time algorithm is 
known for parity games, the algorithmic solution for the satisfaction of the qualita- 
tive specification with certainty is computationally expensive as compared to the 
polynomial time algorithm for MDPs with mean-payoff parity objectives. More- 
over, under probabilistic assumption satisfaction with probability 1 is the natural 
notion. 
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We have implemented our algorithm in the tool Quasy, a quantitative synthe- 
sis tool for constructing worst-case and average-case optimal systems with respect 
to a qualitative and a quantitative specification. We can check if an optimal finite- 
state system exists and then either constructs an optimal or an e-optimal system 
depending on the outcome of the check. 

In our future work, we will explore different directions to improve the perfor- 
mance of Quasy. In particular, a recent paper by Wimmer et al. [45] presents an 
efficient technique for solving MDP with mean-payoff objectives based on combin- 
ing symbolic and explicit computation. We will investigate if symbolic and explicit 
computations can be combined for MDPs with mean-payoff parity objectives as 
well. 
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